# load the file
import json as json
with open('speeches.json', 'r') as file:
= json.load(file) speeches
Understanding US Presidential Speeches from 1900 till Today
Speech Topics and Semantic Similairty Over Time
1 Introduction
Much criticism of then-Candidate, now-President, Donald Trump has centered on him being unfit for the Office of President of the United States, or his “unpresidentiality”. Some of this criticism also emerges from Donald Trump’s rhetorical style, which has also been deemed “unpresidential”. But what does “Presidentiality” mean? Are there common traits, character qualities, rhetorical styles, or other elements that are common to US Presidents?
While these are valuable and interesting questions to ask, this memo aims to answer two questions related to these questions in detail; namely, whether there are certain topics that are common to US Presidential speeches over time, and whether US Presidents have given speeches and addresses in a way similar to each other?
2 Data
In order to answer these two questions, I use data on Presidential Speeches and Addresses available publicly from the Miller Center at the University of Virginia. Their data on Presidential Speeches is a collection of addresses and speeches, in text form, given by US Presidents from George Washington’s time till the present presidency. While the collection is not exhaustive, it is extensive and contains over 1,000 speeches within it. The data is available to the public in a JSON file format, encoded with a textual transcription of the speech, the date the speech was given, its corresponding President, and title.
3 Methodology
3.1 Selection of the Data
To account for temporal shifts in American society and the realignment in domestic and foreign policy goals, I decided to limit the data to only include speeches from Presidents from the 20th Century on - as such, beginning from Theodore Roosevelt’s Presidency in 1901 onward. This method allows us to capture the historical trend of topics relevant today without skewing the data towards topics that may have been more relevant pre-20th Century but are less so today. In such, our data gains temporal breadth while maintaining relevance for our current world.
Additionally, only speeches given by the President while they were in office are kept within the dataset, and any speeches given while holding other political offices or while campaigning for the Presidency are removed to maintain consistency across for all Presidents within the dataset.
Finally, only minimal text cleaning operations were applied to the speeches, in order to maintain semantic and contextual coherence of the speeches and to optimize the text available for the topic modelling algorithm.
3.2 Analysis of Topics In Speeches
BERTopic Analysis
The first method I used is BERTopic Analysis to identify which topics were found within the speeches. BERTopic is a machine learning technique that groups similar texts into topics by analyzing patterns in language. It uses advanced language models to understand meaning, then clusters related texts together. This helps identify key themes across large sets of documents, like my speeches, without needing manual categorization.
This graph shows the top five topics that were identified by BERTopic from the speeches, graphed year over year on the x-axis. The y-axis shows the proportion of speeches in each year in which the following topics were identified. Proportion helps account for any offsets within our dataset in which some years had more speeches and others less so.
The graph below follows the same x and y-axis outline as above, only this time is split to show each topic’s prevalence individually.
Visualization of these graphs helps reveal important lessons and information:
- The top five topics in US presidential speeches of the 20th and 21st Century concern the topics of:
- Banking and Gold
- Health Care
- War and Peace
- Civil Rights and Racial Discourse
- The Vietnam War
- Topics spoken about by US Presidents are contingent upon the changes within the domestic or international political systems.
- For example, discussions on Banking and Gold peaked around the Great Depression, when the US left the Gold Standard, and the 2008 Financial but declines at other times.
- Discussion on Vietnam were not significantly prevalent before the Vietnam War nor were they significantly present after.
- The topic of Race and Racial/Civil Equality and War and Peace are consistently recurring over the years in the United States, showing that while both conflicts have persisted at the forefront of discussions of the highest Office in the United States, concerns of international wars and peace are more prevelant and important to US Presidents.
Cosine Similarity
The cosine angle \(\cos(\theta)\) between two vectors/documents \(a\) and \(b\) can be defined as:
\[ \cos(\theta) = \frac{a \cdot b}{\|a\| \|b\|} \] where \({\|a\|}\) and \({\|b\|}\) are the magnitudes of vectors \(a\) and \(b\).
Cosine Similarity with Word Embeddings
Cosine similarity is a technique used to measure how similar two pieces of text are by comparing the angle between them in a multi-dimensional space. When combined with word embeddings — mathematical representations of words that capture their meanings and contexts — this method allows us to compare texts based on their semantic similarity. Word embeddings generated by BERTopic understands the nuances of language beyond simply matching words. This is useful for understanding the meaning of sentences.
This graph shows the cosine similarity score of each Presidential speech using the word embeddings method. A darker square equates to a higher score and a lighter score signifies a lower score. Presidents are listed on both the x and y axes in an ascending chronological order, and scores are compared in comparison to each other. The black squares running up diagonally (left-right) signify a score of 1.0, meaning that square is that of the President in question.
This graph shows that most Presidents, from Calvin Coolidge to Bill Clinton, with the exception of Lyndon B. Johnson spoke about similar content to each other. However, Presidents from George W. Bush onwards have spoken on topics more thematically similar to each other rather than other previous Presidents. It also shows that Donald Trump has the lowest level of similarity to Presidents before him compared to other Presidents, a trend that is not the norm.
Cosine Similarity with TF-IDF
Using TF-IDF (Term Frequency–Inverse Document Frequency) cosine similarity, texts are compared based on how often words appear in them, adjusted for how common those words are across all documents. However, it doesn’t capture deeper meaning — two texts might mean the same thing but use different words, and TF-IDF would consider them as not similar. TF-IDF is advantageous for identifying similar vocabulary, while embedding is better for comparing similar ideas or tones.
This graph shows the cosine similarity score of each Presidential speech using the TF-IDF method. A darker square equates to a higher score and a lighter score signifies a lower score. Presidents are listed on both the x and y axes in an ascending chronological order, and scores are compared in comparison to each other. The black squares running up diagonally (left-right) signify a score of 1.0, meaning that square is that of the President in question.
This graph shows that most Presidents, from Calvin Coolidge to Bill Clinton, with the exception of Gerald Ford spoke using similar vocabulary to each other. However, Presidents from George W. Bush onwards have spoken with similar vocabulary to each other, rather than Presidents that have come before. Intriguingly, Joe Biden has had higher levels of terminological consistency with past Presidents than his other fellow recent Presidents. As both Gerald Ford and Donald Trump seem to have lower cosine scores in comparison with fellow Presidents, running a simple line of code reveals that where Ford has an average similarity score of 0.56, Trump has an average similarity score of 0.62. Thus, Gerald Ford was verbally less similar to other Presidents than Donald Trump.
4 Conclusion
This analysis reveals deeper insights into the nature of the US Presidency, particularly the public speaking aspect of the highest office of the land.
By applying BERTopic Analysis and Cosine Similarity using Word Embedding and TF-IDF, this study reveals three main outcomes.
Firstly, that the proportion of topic prevalence in Presidential speeches is time dependent – it fluctuates in accordance with changes in the political spheres, either global or domestic.
Secondly, both with content and vocabulary, there is a similarity between Presidents from Coolidge until Clinton, after which there is a break and a set of new similarities that begin.
Thirdly and lastly, while Donald Trump’s rhetorical choices have been criticized as the great break from previous Presidents, this is only verifiable in terms of topic/thematic consistency and not in terms of vocabulary.
5 Appendix
This is a technical appendix for the operations performed to create this memo.
Part 1: Loading the Data
#convert to a pandas DataFrame
import pandas as pd
= pd.json_normalize(speeches) df
Part 2: Cleaning and Organizing the Text
# We only want to keep Presidents who start from 1900 on and drop all others
= [
keep_presidents "Theodore Roosevelt", "William Taft", "Woodrow Wilson",
"Warren G. Harding", "Calvin Coolidge", "Herbert Hoover",
"Franklin D. Roosevelt", "Harry S. Truman", "Dwight D. Eisenhower",
"John F. Kennedy", "Lyndon B. Johnson", "Richard M. Nixon",
"Gerald Ford", "Jimmy Carter", "Ronald Reagan",
"George H. W. Bush", "Bill Clinton", "George W. Bush",
"Barack Obama", "Donald Trump", "Joe Biden"
]
= df[df["president"].isin(keep_presidents)]
df_new
= df_new.drop(['doc_name', 'title'], axis=1)
df_new
# Chronological List
"date", inplace=True)
df_new.sort_values(=True, inplace=True) df_new.reset_index(drop
# Define the dates in which the President came to office and when they left
= {
president_terms "Theodore Roosevelt": ("1901-09-14", "1909-03-04"),
"William Taft": ("1909-03-04", "1913-03-04"),
"Woodrow Wilson": ("1913-03-04", "1921-03-04"),
"Warren G. Harding": ("1921-03-04", "1923-08-02"),
"Calvin Coolidge": ("1923-08-02", "1929-03-04"),
"Herbert Hoover": ("1929-03-04", "1933-03-04"),
"Franklin D. Roosevelt": ("1933-03-04", "1945-04-12"),
"Harry S. Truman": ("1945-04-12", "1953-01-20"),
"Dwight D. Eisenhower": ("1953-01-20", "1961-01-20"),
"John F. Kennedy": ("1961-01-20", "1963-11-22"),
"Lyndon B. Johnson": ("1963-11-22", "1969-01-20"),
"Richard M. Nixon": ("1969-01-20", "1974-08-09"),
"Gerald Ford": ("1974-08-09", "1977-01-20"),
"Jimmy Carter": ("1977-01-20", "1981-01-20"),
"Ronald Reagan": ("1981-01-20", "1989-01-20"),
"George H. W. Bush": ("1989-01-20", "1993-01-20"),
"Bill Clinton": ("1993-01-20", "2001-01-20"),
"George W. Bush": ("2001-01-20", "2009-01-20"),
"Barack Obama": ("2009-01-20", "2017-01-20"),
"Donald Trump": ("2017-01-20", "2021-01-20"),
"Joe Biden": ("2021-01-20", "2025-01-20"),
"Donald Trump": ("2024-01-20", "2025-04-27"),
}
'date'] = pd.to_datetime(df_new['date'], format='ISO8601', utc=True, errors='coerce')
df_new[= df_new.dropna(subset=['date']) # drop any na's
df_new 'date'] = df_new['date'].dt.date df_new[
# Update dictionary to have start and end in a clean date format
for pres, (start, end) in president_terms.items():
= pd.to_datetime(start).date()
start_date = pd.to_datetime(end).date() if end else pd.Timestamp.today().date()
end_date = (start_date, end_date) president_terms[pres]
# Drop speeches by any president in which they were not actively in office, ensure only presidential speeches in our data
def was_president_at_time(row):
= row['president']
pres = row['date']
date
if pres in president_terms:
= president_terms[pres]
start, end return start <= date <= end
return False
= df_new[df_new.apply(was_president_at_time, axis=1)].reset_index(drop=True) df_proper
Part 3: BERTopic Analysis
# Import in all relevant models
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models.ldamodel import LdaModel
from gensim import corpora
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import numpy as np
# Makes text of speech into one constant string, no returns or breaks included
def clean_text(text):
= text.replace('\n', ' ')
text return text.strip()
'cleaned_text'] = df_proper['transcript'].apply(clean_text) df_proper[
# Create chunks of 300 words of the speeches
def chunk_text(text, max_words=300):
= text.split()
words return [' '.join(words[i:i+max_words]) for i in range(0, len(words), max_words)]
'chunks'] = df_proper['cleaned_text'].apply(chunk_text)
df_proper[
# Create separate dataframe of all chunks flattened to analyze
= [chunk for chunks in df_proper['chunks'] for chunk in chunks] docs_chunked
# Add a speech id index
= df_proper.reset_index(drop=True)
df_proper 'speech_id'] = df_proper.index + 1 df_proper[
# Initialize BERTopic Model and apply it to the dataframe of just chunks
import os
"TOKENIZERS_PARALLELISM"] = "false"
os.environ[
import contextlib
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP
= UMAP(random_state=42)
umap_model
#embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
#vectorizer_model = CountVectorizer(stop_words="english")
= BERTopic(
topic_model #embedding_model=embedding_model,
#vectorizer_model=vectorizer_model, # additional models that could be used included here
=True,
calculate_probabilities=False,
verbose= umap_model,
umap_model #top_n_words=7,
#nr_topics="auto",
)
= topic_model.fit_transform(docs_chunked) topics, probs
# Get information on the topics identified by BERTopic, counts for each topic, associated keywords
= topic_model.get_topic_info()[['Topic', 'Count', 'Name', 'Representation']]
topic_info
= topic_info[["Topic", 'Count', "Representation"]]
topic_info_simple
# Remove spaces and joins keywords together for better visibility
'Representation'] = topic_info_simple['Representation'].apply(lambda x: ' '.join(dict.fromkeys(x).keys()))
topic_info_simple[
= pd.DataFrame(topic_info_simple) frequency_table
# Create a new Dataframe of just chunked text with requisite column names, fashioned just like our old DF
= []
chunked_data
for idx, row in df_proper.iterrows():
= chunk_text(row['cleaned_text'], max_words=300)
chunks for chunk in chunks:
chunked_data.append({"original_speech_id": row['speech_id'],
"president": row['president'],
"date": row['date'],
"transcript": chunk
})
= pd.DataFrame(chunked_data) df_chunked
# Create topic column that attaches associated topics to each chunk
'topic'] = topics df_chunked[
# Removes Topic -1 (topics with keywords such as 'the', 'of', 'a', etc) and replaces chunks with the next most possible Topic that fits
## Ensures that speech chunks all have the best associated topic and not this catch all extra words
= [-1]
excluded_topics
def reassign_topic(topic, prob_row):
if topic in excluded_topics:
= np.argsort(prob_row)[::-1]
sorted_indices for idx in sorted_indices:
if idx not in excluded_topics:
return idx
return topic
else:
return topic
"topic"] = [
df_chunked[for t, p in zip(df_chunked["topic"], probs)
reassign_topic(t, p) ]
# Map topic labels next to the Topic for better visualization
# Attaches the year a speech was given to each column
= {
topic_labels "Topic"]: row["Representation"]
row[for _, row in topic_info_simple.iterrows()
}
"topic_label"] = df_chunked["topic"].map(topic_labels)
df_chunked['year'] = pd.to_datetime(df_chunked['date']).dt.year df_chunked[
Data Visualization
# Creates a dataframe of the top 5, most present topics
= df_chunked['topic'].value_counts().head(5).index
top_5_topics = df_chunked[df_chunked['topic'].isin(top_5_topics)] df_top_5
# Count the number of speeches for each given topic by year
= df_top_5.groupby(['year', 'topic']).size().reset_index(name='count')
df_count_by_year
= df_chunked.groupby('year').size().reset_index(name='total')
df_total_by_year = pd.merge(df_count_by_year, df_total_by_year, on='year')
df_count_by_year
# Create a proportion of the number of speeches with topics counted by number of total speeches
## Accounts for years where there may be less speeches or more
'proportion'] = df_count_by_year['count'] / df_count_by_year['total']
df_count_by_year['topic_labels'] = df_count_by_year["topic"].map(topic_labels) df_count_by_year[
# Create summary labels for each of the following topics for easier visualization on a graph
= {
manual_labels 0: "Vietnam War",
1: "Health Care",
7: "Banks, Credit, Gold",
2: "Peace, Nations, War",
3: "Rights, Blacks, White",
}
'manual_labels'] = df_count_by_year["topic"].map(manual_labels) df_count_by_year[
# Create an area graph for the 5 topics together
library(ggplot2)
<- reticulate::py$df_count_by_year
count_by_year
ggplot(count_by_year, aes(x = year, y = proportion, fill = manual_labels)) +
geom_area() +
theme_minimal() +
labs(title = "Top 5 Topics in Presidential Speeches Over Time",
x = "Year",
y = "Proportion of Speeches",
fill = "Topic") +
scale_fill_viridis_d() +
theme(plot.title = element_text(size = 12,face='bold'),
legend.position = "bottom",
legend.text = element_text(size = 6))
# Create line graph for each topic/graph by itself
ggplot(count_by_year, aes(x = year, y = proportion)) +
geom_line(aes(color = manual_labels)) +
facet_wrap(~ manual_labels, scales = "free_y") +
theme_minimal() +
labs(title = "Top 5 Topics in Presidential Speeches Over Time",
x = "Year",
y = "Proportion of Speeches",
color = "Topic") +
scale_color_viridis_d() +
theme(
legend.position = "none",
plot.title = element_text(size = 10,face='bold')
)
Part 4: Cosine Similarity
Word Embedding
# Get embedded topics using the BERTopic model
= topic_model.transform(df_chunked["transcript"].tolist())
topics, probs = topic_model._extract_embeddings(df_chunked["transcript"].tolist(), method="document")
embeddings
"embedding"] = list(embeddings) df_chunked[
# Group presidential speeches and embeddings by president, get average embedding to get just one for each president
= df_chunked.groupby("president")["embedding"].apply(
president_embedding lambda emb_list: np.mean(np.vstack(emb_list), axis=0)
)
# Apply cosine similarity test to the data from above
from sklearn.metrics.pairwise import cosine_similarity
= np.vstack(president_embedding.values)
X = cosine_similarity(X) cosine_similarity_matrix
# Creates Dataframe of the cosine similarity scores and prints the scores of each president in comparison to each other
= president_embedding.index.tolist()
presidents
= pd.DataFrame(cosine_similarity_matrix, index=keep_presidents, columns=keep_presidents) similarity_df
Word Embedding Visualization
# Print a heatmap for easier visualization using the viridis color scale for best visualization
library(reshape2)
library(viridis)
<- reticulate::py$similarity_df
similarity_matrix <- as.matrix(similarity_matrix)
similarity_matrix <- melt(similarity_matrix, varnames = c("president_1", "president_2"))
melted_matrix
ggplot(melted_matrix, aes(x = president_1, y = president_2, fill = value)) +
geom_tile(color = "white", linewidth = 0.3) +
scale_fill_viridis(
option = "viridis", # Try "magma", "plasma", or "inferno" for other easily visualizable variants
direction = -1,
limits = c(min(melted_matrix$value), max(melted_matrix$value))
+
) theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 6),
axis.text.y = element_text(size = 10),
legend.position = "right",
plot.title = element_text(size = 10,face='bold')
+
) labs(
x = "President",
y = "President",
title = "Word Embedding - Cosine Similarity of Presidential Speeches",
fill = "Similarity"
+
) coord_fixed()
TF-IDF Cosine Similarity
# Import vectorizer and list of stopwords to clean our text
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import nltk
'stopwords')
nltk.download(= set(stopwords.words('english'))
stop_words
# Define function to remove stopwords from a text
def remove_stopwords(text):
= text.split()
words = [word for word in words if word.lower() not in stop_words]
filtered_words return " ".join(filtered_words)
# Join all text into a singular string
= df_proper.groupby('president')['cleaned_text'].apply(" ".join).reset_index()
presidents_aggregated
# Cleans the text by removing stopwords
'cleaned_text'] = presidents_aggregated['cleaned_text'].apply(remove_stopwords)
presidents_aggregated[
# Vectorize the text and save the vectorized text seperately
= CountVectorizer()
vectorizer = vectorizer.fit_transform(presidents_aggregated['cleaned_text']) president_dfm
# Apply cosine similarity test and create a Dataframe
= cosine_similarity(president_dfm,president_dfm)
pres_cosine
= pd.DataFrame(
similarity_df_2
pres_cosine,=keep_presidents,
index=keep_presidents
columns )
TF-IDF Visualization
# Creates another heatmap in the same style as the previous one, using viridis for ease of visualization
<- reticulate::py$similarity_df_2
similarity_matrix_2 <- as.matrix(similarity_matrix_2)
similarity_matrix_2 <- melt(similarity_matrix_2, varnames = c("president_1", "president_2"))
melted_matrix_2
ggplot(melted_matrix_2, aes(x = president_1, y = president_2, fill = value)) +
geom_tile(color = "white", linewidth = 0.3) +
scale_fill_viridis(
option = "viridis", # Try "magma", "plasma", or "inferno" for other easily visualizable variants
direction = -1,
limits = c(min(melted_matrix_2$value), max(melted_matrix_2$value))
+
) theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 6),
axis.text.y = element_text(size = 10),
legend.position = "right",
plot.title = element_text(size = 10, face='bold')
+
) labs(
x = "President",
y = "President",
title = "TF-IDF Cosine Similarity of Presidential Speeches",
fill = "Similarity"
+
) coord_fixed()